Model Selection

Image-text retrieval

# Image-text retrieval

FG-CLIP is a fine-grained vision and text alignment model that achieves global and region-level image-text alignment through two-stage training, enhancing fine-grained visual understanding ability.

Multimodal Alignment

Transformers English

Siglip2 So400m Patch14 384

SigLIP 2 is a vision-language model based on the SigLIP pre-training objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 So400m Patch14 224

SigLIP 2 is an improved multilingual vision-language encoder based on SigLIP, enhancing semantic understanding, localization, and dense feature extraction capabilities.

Siglip2 Large Patch16 512

SigLIP 2 is an improved model based on SigLIP, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Llm Jp Clip Vit Large Patch14

A Japanese CLIP model trained based on the OpenCLIP framework, trained on a dataset of 1.45 billion Japanese image-text pairs, supporting zero-shot image classification and image-text retrieval tasks

Text-to-Image Japanese

Llm Jp Clip Vit Base Patch16

Japanese CLIP model trained on OpenCLIP framework, supporting zero-shot image classification tasks

Text-to-Image Japanese

Siglip So400m Patch14 224

SigLIP is an improved multimodal model based on CLIP, employing a superior Sigmoid loss function, pre-trained on the WebLI dataset, and suitable for tasks such as zero-shot image classification and image-text retrieval.

Siglip So400m Patch14 384

SigLIP is a vision-language model pre-trained on the WebLi dataset, employing an improved sigmoid loss function to optimize image-text matching tasks.

Siglip Large Patch16 384

SigLIP is a multimodal model pretrained on the WebLi dataset, utilizing an improved Sigmoid loss function, suitable for zero-shot image classification and image-text retrieval tasks.

Siglip Large Patch16 256

SigLIP is a vision-language model pre-trained on the WebLi dataset, utilizing an improved sigmoid loss function to enhance performance

Siglip Base Patch16 512

SigLIP is a vision-language model pretrained on the WebLi dataset, utilizing an improved sigmoid loss function, excelling in image classification and image-text retrieval tasks.

Siglip Base Patch16 384

SigLIP is a multimodal model pre-trained on the WebLi dataset, employing an improved sigmoid loss function, suitable for zero-shot image classification and image-text retrieval tasks.

Siglip Base Patch16 256

SigLIP is a vision-language model pre-trained on the WebLi dataset, employing an improved Sigmoid loss function, excelling in image classification and image-text retrieval tasks.

A visual-language generation model fine-tuned for image-text retrieval tasks, improved based on google/flan-t5-xl

Transformers English

Clip Flant5 Xxl

A vision-language generation model fine-tuned based on google/flan-t5-xxl, specifically designed for image-text retrieval tasks

Transformers English

Siglip Base Patch16 224

SigLIP is a vision-language model pretrained on the WebLi dataset, utilizing an improved Sigmoid loss function to optimize image-text matching tasks

CLIP Convnext Xxlarge Laion2b S34b B82k Augreg Rewind

A CLIP ConvNeXt-XXLarge model trained on the LAION-2B dataset, implemented using the OpenCLIP framework, focusing on zero-shot image classification tasks.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase